Prosper Loan Data Analysis by Jason Yang

In this analysis, we explore loan data from Prosper. This data set contains 113,937 loans with 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, borrower employment status, borrower credit history, and the latest payment information.

Univariate Plots Section

'data.frame':   113937 obs. of  81 variables:
 $ ListingKey                         : Factor w/ 113066 levels "00003546482094282EF90E5",..: 7180 7193 6647 6669 6686 6689 6699 6706 6687 6687 ...
 $ ListingNumber                      : int  193129 1209647 81716 658116 909464 1074836 750899 768193 1023355 1023355 ...
 $ ListingCreationDate                : Factor w/ 113064 levels "2005-11-09 20:44:28.847000000",..: 14184 111894 6429 64760 85967 100310 72556 74019 97834 97834 ...
 $ CreditGrade                        : Factor w/ 9 levels "","A","AA","B",..: 5 1 8 1 1 1 1 1 1 1 ...
 $ Term                               : int  36 36 36 36 36 60 36 36 36 36 ...
 $ LoanStatus                         : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
 $ ClosedDate                         : Factor w/ 2803 levels "","2005-11-25 00:00:00",..: 1138 1 1263 1 1 1 1 1 1 1 ...
 $ BorrowerAPR                        : num  0.165 0.12 0.283 0.125 0.246 ...
 $ BorrowerRate                       : num  0.158 0.092 0.275 0.0974 0.2085 ...
 $ LenderYield                        : num  0.138 0.082 0.24 0.0874 0.1985 ...
 $ EstimatedEffectiveYield            : num  NA 0.0796 NA 0.0849 0.1832 ...
 $ EstimatedLoss                      : num  NA 0.0249 NA 0.0249 0.0925 ...
 $ EstimatedReturn                    : num  NA 0.0547 NA 0.06 0.0907 ...
 $ ProsperRating..numeric.            : int  NA 6 NA 6 3 5 2 4 7 7 ...
 $ ProsperRating..Alpha.              : Factor w/ 8 levels "","A","AA","B",..: 1 2 1 2 6 4 7 5 3 3 ...
 $ ProsperScore                       : num  NA 7 NA 9 4 10 2 4 9 11 ...
 $ ListingCategory..numeric.          : int  0 2 0 16 2 1 1 2 7 7 ...
 $ BorrowerState                      : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
 $ Occupation                         : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
 $ EmploymentStatus                   : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
 $ EmploymentStatusDuration           : int  2 44 NA 113 44 82 172 103 269 269 ...
 $ IsBorrowerHomeowner                : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
 $ CurrentlyInGroup                   : Factor w/ 2 levels "False","True": 2 1 2 1 1 1 1 1 1 1 ...
 $ GroupKey                           : Factor w/ 707 levels "","00343376901312423168731",..: 1 1 335 1 1 1 1 1 1 1 ...
 $ DateCreditPulled                   : Factor w/ 112992 levels "2005-11-09 00:30:04.487000000",..: 14347 111883 6446 64724 85857 100382 72500 73937 97888 97888 ...
 $ CreditScoreRangeLower              : int  640 680 480 800 680 740 680 700 820 820 ...
 $ CreditScoreRangeUpper              : int  659 699 499 819 699 759 699 719 839 839 ...
 $ FirstRecordedCreditLine            : Factor w/ 11586 levels "","1947-08-24 00:00:00",..: 8639 6617 8927 2247 9498 497 8265 7685 5543 5543 ...
 $ CurrentCreditLines                 : int  5 14 NA 5 19 21 10 6 17 17 ...
 $ OpenCreditLines                    : int  4 14 NA 5 19 17 7 6 16 16 ...
 $ TotalCreditLinespast7years         : int  12 29 3 29 49 49 20 10 32 32 ...
 $ OpenRevolvingAccounts              : int  1 13 0 7 6 13 6 5 12 12 ...
 $ OpenRevolvingMonthlyPayment        : num  24 389 0 115 220 1410 214 101 219 219 ...
 $ InquiriesLast6Months               : int  3 3 0 0 1 0 0 3 1 1 ...
 $ TotalInquiries                     : num  3 5 1 1 9 2 0 16 6 6 ...
 $ CurrentDelinquencies               : int  2 0 1 4 0 0 0 0 0 0 ...
 $ AmountDelinquent                   : num  472 0 NA 10056 0 ...
 $ DelinquenciesLast7Years            : int  4 0 0 14 0 0 0 0 0 0 ...
 $ PublicRecordsLast10Years           : int  0 1 0 0 0 0 0 1 0 0 ...
 $ PublicRecordsLast12Months          : int  0 0 NA 0 0 0 0 0 0 0 ...
 $ RevolvingCreditBalance             : num  0 3989 NA 1444 6193 ...
 $ BankcardUtilization                : num  0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
 $ AvailableBankcardCredit            : num  1500 10266 NA 30754 695 ...
 $ TotalTrades                        : num  11 29 NA 26 39 47 16 10 29 29 ...
 $ TradesNeverDelinquent..percentage. : num  0.81 1 NA 0.76 0.95 1 0.68 0.8 1 1 ...
 $ TradesOpenedLast6Months            : num  0 2 NA 0 2 0 0 0 1 1 ...
 $ DebtToIncomeRatio                  : num  0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
 $ IncomeRange                        : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
 $ IncomeVerifiable                   : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
 $ StatedMonthlyIncome                : num  3083 6125 2083 2875 9583 ...
 $ LoanKey                            : Factor w/ 113066 levels "00003683605746079487FF7",..: 100337 69837 46303 70776 71387 86505 91250 5425 908 908 ...
 $ TotalProsperLoans                  : int  NA NA NA NA 1 NA NA NA NA NA ...
 $ TotalProsperPaymentsBilled         : int  NA NA NA NA 11 NA NA NA NA NA ...
 $ OnTimeProsperPayments              : int  NA NA NA NA 11 NA NA NA NA NA ...
 $ ProsperPaymentsLessThanOneMonthLate: int  NA NA NA NA 0 NA NA NA NA NA ...
 $ ProsperPaymentsOneMonthPlusLate    : int  NA NA NA NA 0 NA NA NA NA NA ...
 $ ProsperPrincipalBorrowed           : num  NA NA NA NA 11000 NA NA NA NA NA ...
 $ ProsperPrincipalOutstanding        : num  NA NA NA NA 9948 ...
 $ ScorexChangeAtTimeOfListing        : int  NA NA NA NA NA NA NA NA NA NA ...
 $ LoanCurrentDaysDelinquent          : int  0 0 0 0 0 0 0 0 0 0 ...
 $ LoanFirstDefaultedCycleNumber      : int  NA NA NA NA NA NA NA NA NA NA ...
 $ LoanMonthsSinceOrigination         : int  78 0 86 16 6 3 11 10 3 3 ...
 $ LoanNumber                         : int  19141 134815 6466 77296 102670 123257 88353 90051 121268 121268 ...
 $ LoanOriginalAmount                 : int  9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
 $ LoanOriginationDate                : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 426 1866 260 1535 1757 1821 1649 1666 1813 1813 ...
 $ LoanOriginationQuarter             : Factor w/ 33 levels "Q1 2006","Q1 2007",..: 18 8 2 32 24 33 16 16 33 33 ...
 $ MemberKey                          : Factor w/ 90831 levels "00003397697413387CAF966",..: 11071 10302 33781 54939 19465 48037 60448 40951 26129 26129 ...
 $ MonthlyLoanPayment                 : num  330 319 123 321 564 ...
 $ LP_CustomerPayments                : num  11396 0 4187 5143 2820 ...
 $ LP_CustomerPrincipalPayments       : num  9425 0 3001 4091 1563 ...
 $ LP_InterestandFees                 : num  1971 0 1186 1052 1257 ...
 $ LP_ServiceFees                     : num  -133.2 0 -24.2 -108 -60.3 ...
 $ LP_CollectionFees                  : num  0 0 0 0 0 0 0 0 0 0 ...
 $ LP_GrossPrincipalLoss              : num  0 0 0 0 0 0 0 0 0 0 ...
 $ LP_NetPrincipalLoss                : num  0 0 0 0 0 0 0 0 0 0 ...
 $ LP_NonPrincipalRecoverypayments    : num  0 0 0 0 0 0 0 0 0 0 ...
 $ PercentFunded                      : num  1 1 1 1 1 1 1 1 1 1 ...
 $ Recommendations                    : int  0 0 0 0 0 0 0 0 0 0 ...
 $ InvestmentFromFriendsCount         : int  0 0 0 0 0 0 0 0 0 0 ...
 $ InvestmentFromFriendsAmount        : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Investors                          : int  258 1 41 158 20 1 1 1 1 1 ...

Our dataset consists of 81 variables and about 114,000 observations.

To start the investigation, I wanted to explore what type of customers Prosper make loans to. By plotting the income range, we can see that most of the borrowers are between income ranges of $25,000 and $75,000. I wonder what this plot looks like across other categorical variables of employment status, homeownership, and loan status.

levels(prosperLoanData$LoanStatus) <- c('Cancelled','Chargedoff', 'Completed', 
                                        'Current','Defaulted',
                                        'FinalPaymentInProgress','Past Due',  
                                        'Past Due', 'Past Due','Past Due', 
                                        'Past Due', 'Past Due')

levels(prosperLoanData$LoanStatus)
[1] "Cancelled"              "Chargedoff"            
[3] "Completed"              "Current"               
[5] "Defaulted"              "FinalPaymentInProgress"
[7] "Past Due"              

Combined all Past due statuses into a single past due status.

Most borrowers are employed, this categorical variable has 2,255 null values. The number of borrowers that is a homeowner is about the same as the number not owning a home. A majority of the loan that was taken out has a status of completed or current.

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
0.00653 0.15629 0.20976 0.21883 0.28381 0.51229      25 
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
    9.5   669.5   689.5   695.1   729.5   889.5     591 
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1000    4000    6500    8337   12000   35000 

The lowest APR was 0.00653 and the highest APR was 0.512. On average, a borrower had an APR of 0.219, which was higher than the median APR of 0.210. Visually, we can see the Borrower APR graph skewed slightly to the right, with a two local maximum peaks to the right of 0.25. I wonder how higher APR is connected to homeownership and employment status.
To analyze credit score, I introduced a new variable called “CreditScoreAverage.” This new variable is the average of the upper and lower credit score in our dataset. I’ve trimmed the graph to be between 450 and 850, which was where most of the data was. Most users had a credit score of around 690.
Viewing the original amount of loan incurred, common amounts include increments of thousands (i.e. $1,000; $2,000; etc.) in low ranges and increments of 5 thousands up to $35,000. The median loan original amount was $6,500.

Univariate Analysis

What is the structure of your dataset?

Our dataset consists of 81 variables and about 114,000 observations. Borrowers are categorized into income ranges, homeownership, and loan status.
Observations:
- The number of homeowners and non-homeowners are about the same
- Median credit score is 690
- Median and mean APR are 0.210 and 0.219, respectively
- Most loan status are complete or current.
- A majority of borrowers make loans below $10,000.

What is/are the main feature(s) of interest in your dataset?

For me, the main features of interest are borrower’s credit score, original loan amount, and borrower’s APR. I suspect borrower’s APR can be modeled using borrower’s credit score in some combination of other variables. I think APR is heavily dependent on credit score and loan amount, but other categorical variables may reveal additional relationships.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

Other features of interest include homeownership, employment status, and income range. It would be interesting to investigate the dataset using loan status feature to see which loans are deliquent (charged off, defaulted, past due).

Did you create any new variables from existing variables in the dataset?

For the univariant analysis, I combined all Past due statuses into a single past due status. I also created the “CreditScoreAverage” variable.

Bivariate Plots Section

As one would expect, higher credit scores are usually related with lower APR. However, credit scores do not seem to be the only deciding factor when determining APR. From the graph, there are distinct horizontal lines at 0.3 and 0.375 APR. I think plotting the borrower APR and loan amount graph will help.

There is a weak correlation between loan amount and borrower APR. It seems like when borrowing above $30,000, APR is more likely to be between 0.2 and 0.1, but I think this has more to do with the type of users borrowing in high numbers. Regardless, there is a weak correlation in that APR goes down when borrowing at higher amounts, which makes sense. Mortgage interest rates are usually lower than credit card interest rates.

prosperLoanData$IncomeRange: $0
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  529.5   649.5   689.5   695.7   749.5   869.5 
-------------------------------------------------------- 
prosperLoanData$IncomeRange: $1-24,999
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  529.5   649.5   689.5   681.2   729.5   889.5 
-------------------------------------------------------- 
prosperLoanData$IncomeRange: $100,000+
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  529.5   689.5   709.5   720.4   749.5   889.5 
-------------------------------------------------------- 
prosperLoanData$IncomeRange: $25,000-49,999
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  529.5   649.5   689.5   689.5   729.5   889.5 
-------------------------------------------------------- 
prosperLoanData$IncomeRange: $50,000-74,999
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  529.5   669.5   689.5   701.2   729.5   889.5 
-------------------------------------------------------- 
prosperLoanData$IncomeRange: $75,000-99,999
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  529.5   669.5   709.5   709.8   749.5   889.5 
-------------------------------------------------------- 
prosperLoanData$IncomeRange: Not displayed
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
    9.5   549.5   609.5   610.7   689.5   889.5     591 
-------------------------------------------------------- 
prosperLoanData$IncomeRange: Not employed
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  529.5   669.5   709.5   704.0   749.5   849.5 

Credit score had less dependency on income range than I expected. All income ranges except “Not displayed” had a median credit score average of 689.5 or 709.5.

I decided to plot the IncomeRange vs BorrowerAPR plot as a point plot instead of a boxplot to visualize where most loans are made. There’s a distinct line of loans made with 0.35 APR over income ranges of $1 and $100,000+. Let’s visualize what type of loans people within an income range is making.

Most loans were made by borrowers in the $25,000 and $75,000 range. Users with higher income are more likely to make bigger loan amounts.

prosperLoanData$IsBorrowerHomeowner: False
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
    9.5   649.5   689.5   675.3   709.5   869.5     591 
-------------------------------------------------------- 
prosperLoanData$IsBorrowerHomeowner: True
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    9.5   669.5   709.5   714.3   749.5   889.5 

Borrowers who are homeowners have slightly higher credit scores (median=709.5) over borrowers who do not own homes (median=689.5). Although the medians are very similar, the two group’s mean differs much more. Borrowers who are homeowners have a mean credit score of 714.3, and borrowers who are not homeowners have a mean score of 675.3. I think we’ll be able to visually see this when we plot credit score vs. APR with homeownership context in the multi-variant section.

prosperLoanData$EmploymentStatus: 
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
    9.5   549.5   609.5   607.1   669.5   869.5     589 
-------------------------------------------------------- 
prosperLoanData$EmploymentStatus: Employed
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  609.5   669.5   709.5   708.8   729.5   889.5 
-------------------------------------------------------- 
prosperLoanData$EmploymentStatus: Full-time
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  529.5   629.5   669.5   678.9   729.5   889.5 
-------------------------------------------------------- 
prosperLoanData$EmploymentStatus: Not available
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
    9.5   549.5   609.5   609.7   689.5   889.5       2 
-------------------------------------------------------- 
prosperLoanData$EmploymentStatus: Not employed
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  529.5   669.5   709.5   703.0   749.5   849.5 
-------------------------------------------------------- 
prosperLoanData$EmploymentStatus: Other
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  609.5   669.5   689.5   704.7   729.5   889.5 
-------------------------------------------------------- 
prosperLoanData$EmploymentStatus: Part-time
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  529.5   629.5   669.5   668.9   709.5   889.5 
-------------------------------------------------------- 
prosperLoanData$EmploymentStatus: Retired
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  529.5   649.5   689.5   696.0   749.5   889.5 
-------------------------------------------------------- 
prosperLoanData$EmploymentStatus: Self-employed
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  529.5   669.5   709.5   709.2   749.5   869.5 

It looks like credit scores vary much more on employment status than on income range. I’m surprised to see that the credit scores of employed and not employed users are about the same at mean=708.8, 703.0 and above that of full-time status (mean=678.9)

prosperLoanData$LoanStatus: Cancelled
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.1466  0.1622  0.2074  0.2058  0.2565  0.2565 
-------------------------------------------------------- 
prosperLoanData$LoanStatus: Chargedoff
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.01823 0.19003 0.26271 0.25775 0.32958 0.46201 
-------------------------------------------------------- 
prosperLoanData$LoanStatus: Completed
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
0.00653 0.13271 0.19479 0.20878 0.28498 0.51229      25 
-------------------------------------------------------- 
prosperLoanData$LoanStatus: Current
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.06106 0.15833 0.20524 0.21374 0.26528 0.35838 
-------------------------------------------------------- 
prosperLoanData$LoanStatus: Defaulted
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.00864 0.17722 0.24001 0.23893 0.29776 0.50633 
-------------------------------------------------------- 
prosperLoanData$LoanStatus: FinalPaymentInProgress
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.06888 0.15833 0.22362 0.22956 0.31032 0.35797 
-------------------------------------------------------- 
prosperLoanData$LoanStatus: Past Due
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.06327 0.21827 0.27285 0.26738 0.32576 0.38486 

A borrower’s APR is determined before the loan is made. It’s interesting to see that the bad loan statuses (charged off, defaulted, past due) have higher APR than that of good loan statuses (completed, current, final payment). I wonder if higher APR was knowingly assigned to borrowers who will have trouble paying back, or is it because of higher APR that resulted in bad loan statuses.

prosperLoanData$LoanStatus: Cancelled
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  509.5   524.5   589.5   604.5   669.5   729.5       1 
-------------------------------------------------------- 
prosperLoanData$LoanStatus: Chargedoff
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
    9.5   609.5   669.5   658.4   709.5   869.5      48 
-------------------------------------------------------- 
prosperLoanData$LoanStatus: Completed
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
    9.5   649.5   689.5   695.1   749.5   889.5     416 
-------------------------------------------------------- 
prosperLoanData$LoanStatus: Current
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  609.5   669.5   709.5   708.2   729.5   889.5 
-------------------------------------------------------- 
prosperLoanData$LoanStatus: Defaulted
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
    9.5   569.5   649.5   630.4   689.5   869.5     126 
-------------------------------------------------------- 
prosperLoanData$LoanStatus: FinalPaymentInProgress
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  609.5   669.5   709.5   709.9   749.5   829.5 
-------------------------------------------------------- 
prosperLoanData$LoanStatus: Past Due
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  609.5   669.5   689.5   697.6   729.5   869.5 

As expected from our previous observation of LoanStatus vs BorrowerAPR, bad loan statuses have lower credit score than good loan statuses. The exception is the canceled loan status, which has the worst credit score.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

The relationship between credit score and borrower APR was as expected. Higher credit scores yielded lower APR. The relationship between loan amount and borrower APR was weaker than I expected. For some reason, I was expecting that larger loans would mean higher APR, but it turned out to be the reverse. Larger loan amount correlated with lower APR. Which makes sense when I think about it. A house mortgage usually has lower interest rate than credit card interest rate.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

The most interesting relationship observed was bad loan statuses are mostly associated with high APR. I wonder if higher APR was knowingly assigned to borrowers who will have trouble paying back, or is it because of higher APR that resulted in bad loan statuses. Borrowers with high APR mostly have lower credit scores, so in effect, loan issuers know that the borrower may have trouble paying the loan back and issued loans with higher APR. However, I wonder how much of bad loan status was the result of bad credit score or the fact that those borrowers had to pay more due to higher APR.

What was the strongest relationship you found?

The strongest relationship was between credit score and borrower APR.

Multivariate Plots Section

The legend came out faint. True is cyan and False is salmon. Homeowners have higher credit scores and lower APR.

Now this graph is interesting. I expected the loan amount and APR to be fairly even between homeowners and non-homeowners up to $20,000. It turns out small loans are mostly made by non-homeowners and large loans are made by homeowners. The lowest rates are largely given to homeowners.

As expected, homeowners have lower overall APR rate across all loan statuses.

Another view of IncomeRange vs. credit score. Homeowners have higher credit score and have higher income.

Large loans are mostly made by users with higher credit score, which yields a lower APR. Larger loans also reduces APR.

As an aside, I wanted to see how prosper scores their borrowers. As expected, low APR are given to borrowers with high prosper score. Having a high credit score could help in getting low APR, but there seems to be other variables at play when determing prosper score. The gray parts of the graph refer to data points that did not have a prosper score. Not all transactions had prosper scores tied to them. From the graph it looks like borrowers below a credit score of 600 do not have a prosper score. Maybe if your credit score is below a certain threshold, other variables take precidence in determining APR over prosper score.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

The “IsBorrowerHomeowner” feature was especially helpful in providing additional context to the plots. From the bivariant analysis, we determined homeowners to have higher credit scores, take higher loans, and have lower APR. With the addition of “IsBorrowerHomeowner”, it is easier to visualize the characteristics of a homeowner.

Were there any interesting or surprising interactions between features?

I found it interesting that the lowest APR was given to homeowners regardless of amount. I expected loan amounts below $5,000 will have about the same APR for homeowners and non-homeowners.


Final Plots and Summary

Plot One

Description One

Bad loan statuses are correlated with higher APR. From this graph, we see bad loan statuses (Chargedoff, Defaulted, Past Due) have higher APR than good loan statuses (Completed, Current, FinalPaymentInProgress). Cancelled is a neutral status because the borrower decided not to take out the loan.

Plot Two

Description Two

Having a high credit score and homeownership is correlated with lower APR. A majority of users with low credit score does not own a home.

Plot Three

Description Three

Small loans are mostly made by non-homeowners and large loans are mostly made by homeowners. The lowest rates are largely given to homeowners regardless of loan amount. Higher APR (0.35) are more common with loans below $10,000.


Reflection

Prosper Loan Data set is a very detailed dataset with many features to analyze. Not from a finance background, I had to read up on the terminology for quite a few words and descriptions for concepts. Something that should be obvious like lower APR in larger loan amounts was not immediately obvious to me. I think that’s one of the challenges in being a data analyst in that we have to become familiar with the data before we can ask the right questions.
In my analysis, I’ve only explored a small set of features from this dataset. There’s much more to analyze with this dataset. A future work would be to explore Prosper’s own ratings of borrowers. In the multi-variant section, I explored the prosper score feature. We saw borrowers with credit scores below 600 was not given a Prosper rating. I’m curious to see what other methodologies Prosper uses to determine a borrower’s APR.